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Abstract 

The multinomial probit (MNP) model is a useful tool for describing discrete-choice data and there 
are a variety of methods for fitting the model. Among them, the algorithms provided by |Imai and] 


van Dyk (2005aI, based on Marginal Data Augmentation, are widely used, because they are efficient 


in terms of convergence and allow the possibly improper prior distribution to be specified directly on 


identifiable parameters. Burgette and Nordheim (20121 modify a model and algorithm of Imai and 


van Dyk (2005a I to avoid an arbitrary choice that is often made to establish identifiability. There is 


an error in the algorithms of Imai and van Dyk (2005aI, however, which affects both their algorithms 


and that of Burgette and Nordheim (20121. This error can alter the stationary distribution and the 


resulting fitted parameters as well as the efficiency of these algorithms. We propose a correction and 
use both a simulation study and a real-data analysis to illustrate the difference between the original and 
corrected algorithms, both in terms of their estimated posterior distributions and their convergence 
properties. In some cases, the effect on the stationary distribution can be substantial. 
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1 Introduction 


The multinomial probit (MNP) model is widely used for describing discrete-choice data in social sciences 
and transportation studies. It is often preferred over the multinomial logit model because it does not 


assume independence of irrelevant alternatives; see, e.g., Hausman and Wise (19781 for details. Moreover, 
the MNP model has a strong connection with the multiperiod probit model, for which binary choices are 


observed over multiple time periods with correlated errors (McCulloch and Rossi, 19941. 

The use of the MNP model was once restricted, because methods, like maximum likelihood estimates 


or simulated moments ( McFaddenf 19891, require evaluating high-dimensional normal integrals, which are 
typically intractable. More recently, advances in Bayesian simulations have boosted the development of 


Markov chain Monte Carlo (MCMC) algorithms for fitting the MNP model (e.g., McCulloch and Rossi 


(1994), Nobile (1998), McCulloch et al. (2000), Imai and van Dyk (2005a), and Burgette and Nordheim 


(2012)). These algorithms avoided evaluating multidimensional integrals, provided reliable model fitting, 
and thus revitalized the use of the MNP model in practice. 

Current MCMC algorithms specify a set of latent Gaussian variables as augmented data, whose relative 
magnitudes determine the choices. Since the augmented model is not identifiable given the observations. 


a proper prior distribution is required to ensure that the posterior distribution is proper. McCulloch 


and Rossi (1994) advocate a Gibbs sampler which was the first feasible Bayesian approach to fitting the 


MNP model. In their specification, however, the prior distribution for the identifiable parameters is only 


determined as a byproduct (Imai and van Dyk 2005a henceforth IvD). An improvement of McCulloch 


and Rossi (1994) is the “hybrid Markov chain” introduced by Nobile (1998), which adds a Metropolis 


step to sample the unidentihable parameters. McCulloch et al. (2000) propose another modihcation of 


McCulloch and Rossi (1994) which specifies a prior distribution directly on the identifiable parameters. 


IvD review these MCMC algorithms, compare their computational performance, and find that, first, [Nobile 


(1998) can be less sensitive to starting values than McCulloch and Rossi] (1994); second, although Nobile’s 
method significantly improves the convergence of the overall chain, the gain seems to be primary for the 
unidentifiable parameter with only slight gain for the identifiable ones; and third, although [McCullo^ 


et al. (2000) solve the problem of prior specihcation, their algorithm can be less efficient in terms of 


convergence than either McCulloch and Rossi (1994) or Nobile (1998) (This final point was also noted 


by McCulloch et al. (2000) and Nobile (2000).) Moreover, IvD point out an error in Nobile’s derivation 
which can alter its stationary distribution. Ironically, as we shall see, the algorithms of IvD also contain 
an error. 
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IvD develop new samplers based on the Marginal Data Angmentation (MDA) algorithm (Meng and 


van Dyk, 1999). The new algorithms are easy to implement becanse they only inclnde draws from standard 


distribntions. IvD demonstrate that hrst, their methods are at least as qnick as the fastest methods in 
terms of convergence, and second, the model is specihed in terms of possibly improper prior distribntions 
that are set directly on the identihable parameters, making the priors relatively easy to interpret. Becanse 
of their apparent advantages, IvD’s algorithms have been widely nsed in practice to ht MNP models; see. 


e.g., Berrett and Calder (2012), Bnrgette and Nordheim (2012), Chandoin (2014), Horinchi et al. (2007), 


Hrnschka (2007), Ln et al. (2012), Qneralt (2012), Sinclair and Whitford ( 2013| ), [Vincent et al. (2013), 


Zhang et al. (2008), etc. This snccess has been aided by a popnlar R package {MNP, Imai and van Dyk 


( M)5b| )). 

Unfortnnately, there are two errors in IvD’s algorithms; both occnr when sampling the variance- 
covariance matrix. First, IvD reparameterize the variables to facilitate the sampling of the variance- 
covariance matrix, and they make a mistake when transforming to the original parameterization. Second, 
when npdating the variance-covariance matrix, a constraint on the matrix is overlooked. These errors can 
alter the stationary distribntion and hence the htted valnes and standard errors of the model parameters. 
They also can affect the efficiency of convergence. 


Bnrgette and Nordheim (2012 henceforth BN) modify the model of IvD by changing the manner in 


which nnidentihability in the scale is addressed. In particnlar, they hx the trace of the variance-covariance 
matrix while IvD, like previons anthors, hx the hrst diagonal element. BN’s algorithm for sampling from 
the posterior distribntion bnilds npon Algorithm 1 of IvD. Thns the two errors made by IvD also affect 
BN’s algorithm. BN even make another mistake when npdating the regression coefficient parameter, /3. 
In this paper, we explain how to correct the errors in algorithms of both IvD and BN, and nse both a 
simnlation stndy and a real-data analysis to illnstrate the difference between the original and the corrected 
algorithms in terms of their estimated posterior distribntions and convergence properties. The corrections 
we propose will be implemented in the MNP R package. 

The remainder of this paper is organized into hve sections. We introdnce the MNP model in Section 
In Section we present the original algorithms in IvD and BN, point ont their errors, and provide the 
corrected algorithms. We also inclnde a short review of MDA, which is nsed by all the algorithms we 
consider. In Sections and we nse a simnlation stndy and a real-data example, respectively, to 
illnstrate the difference between the original and corrected algorithms. Conclnsion and hnal remarks 
appear in Section 
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2 Multinomial Probit Model 


We consider a (p+ l)-class multinomial model. Each observation is a binary (p+ l)-vector, di = {dn,..., 
We model di by conditioning on a latent multivariate normal variable, Ui = {Un, ■ ■ ■, 
dij is one if Uij is larger than all the other components of Ui. Specihcally, 


Ui ~ Np+i SO) and dij 



if Uij = max{Uii, C/i,(p+i)} 
otherwise 


for i = 1, 


n, (1) 


where is a ((p + 1) x q) matrix of known covariates, /3 is a q-vector of unknown parameters, and 
is a {{p + 1) X (p + 1)) unknown variance-covariance matrix. 

Model Q is unidentihable because shifting Ui by any constant or rescaling U by any positive constant, 
does not alter the distribution of di. To avoid this, IvD and BN both follow McCulloch and Rossi ( ]1994 ), 
by expressing each Uij relative to a base category (e.g., and obtain the new latent variable, 

Wi = {Wii ,..., Wip), where Wij = Uj — Ui^^p^iy The distribution of Wi is still multivariate normal, that 
is, 

( 2 ) 


where Xi = PXf and S = with P = [Ip,—J], with Ip a (p x p) identity matrix and J a column 

p-vector of ones. For simplicity, we collapse di into li, which is an integer in {0,... ,p}, dehned as 


Y^ = 


0 if max{Wii,..., Wip}< 0 
k if Wik = max{0, Wn ,..., Wip} 


for i = 1, 


,n. 


( 3 ) 


To ensure identihability, we must also set the scale. IvD adopt the standard solution of McCulloch 
and Rossi (1994); they set the hrst diagonal element of S to one, i.e., = 1. BN propose a different 


solution; they hx the trace of the variance-covariance matrix, i.e., trace(S) = p. They argue that the trace 
restriction is a better choice from three reasons. First, the trace restriction makes it possible to specify 
a symmetric prior for S that is invariant to permutations of rows and columns. Second, when using the 
variance-element restriction (as in IvD), the estimated predicted choice probabilities under the posterior 
distribution can vary largely with the choice of the category corresponding to the unit variance. The trace- 
restricted hts tend to be intermediate among the results of the p possible variance-element restricted hts. 
Third, the trace restriction yields marginal posterior distributions that are easier to interpret. 

To overcome difficulties stemming from the constraint, afi = 1, on the variance-covariance matrix. 
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motivated by McCulloch and Rossi (19941, IvD set Wi = aWi, for i = where a > 0. Then 

Wi ~ Np(Xj/3, S), where /3 = a/3 and S = a^S. Because S can be any positive-definite matrix, IvD 
specify an inverse-Wishart prior distribution, S ~ Inv-Wishart(i/, S). After transforming to c? = and 
S = S/afi, the implied prior distribution on (a^, S) is 


a^|S ~ aQtrace(S'S andp(S)oc|S| ^^''"^^^^^^[trace(5S ^)] = 1}, 


(4) 


where S = S/a^ and the first diagonal element of S is one; I is an indicator function which equals one 
when the condition in the brackets is satisfied, and zero otherwise. They also specify a normal prior 
distribution for /3, /3 ~ Ng(/3o, A). For simplicity, we set (5q = 0. BN adopt the same strategy for setting 
their prior distribution in the context of the constraint, trace(S) = p. In particular, their implied prior 
distribution for (a^, S) is almost the same as IvD except that 


p(S) oc |Sr('^+^’+^)/2[trace(5S-^)] ''^/^/{trace(S) = p}, 


(5) 


where trace(S) = p. They use the same prior distribution as IvD for /3, i.e., {3 ~ Nq(0, A). As IvD state, 
this choice of prior distribution allows both informative and diffuse priors for unknown parameters while 
maintaining simplicity and efficiency of the algorithms. 


3 MDA Algorithms for Fitting MNP Models 

3.1 Marginal Data Augmentation 

The algorithms of IvD and BN are all based on the method of MDA. To describe and correct the errors 
in these algorithms, we briefly review MDA. First, denoting (/3,S) by 9, the Data Augmentation (DA) 
algorithm ( Tanner and Won^ 19871 is designed to sample from the posterior distribution, p{9, 1F|T), by 
updating from p(W\6,Y) and p{6\W,Y) iteratively. In this section, we regard Y, 9 and W as generic 
observed data, unknown parameter of interest, and latent variables, respectively. 


Although easy to implement, the DA algorithm can be slow to converge. The MDA algorithm (Meng 


and van Dyk, 19991 improves the convergence rate of a standard DA algorithm by expanding its state 


space. Specifically, MDA introduces a working parameter, a, into the augmented-data model p{W,Y\9)] 
a is not identifiable under the observed-data model p(Y\9). An MCMC sampler is run on the expanded 
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model p(W^ y|0, a), which is designed to maintain p{Y\0) as its marginal distribution, that is, 

jp{W, Y\9, a)dW = p(Y\9). (6) 

A general method for introducing a into an augmented-data model is to use a one-to-one mapping, 

W = Fa{W), for any given a, (7) 


which is differentiable when W is continuous. For each F, there typically exists a value ao such that Fag 
is an identity function, Fag{W) = W. With this construction, the MDA algorithm proceeds by iterating 


Step 1: ~ p{W ,a\9^^\Y), 

Step 2: ~ p{9, a\W^^+^\Y). 


( 8 ) 


Note that in the sampler in (|^, a is sampled in both steps and its first update is not part of the final 
output. We define such updates as intermediate quantities and indicate them with superscript The 
sampler in (pj) is a collapsed DA sampler (Liu et al. 1994), since its two steps can be considered as 


sampling W and 9 with a integrated out. In this regard, the sampler in (|^ is equivalent to a standard 
DA sampler constructed for the conditional distributions of p{W, 9\Y). Thus the marginal Markov chain, 
{9^^\t = 0,1,... }, produced by the sampler in ([^ is reversible with p{9\Y) as its stationary distribution. 
Collapsing a out increases the (expected) variance of the conditional distributions sampled in (®. This 


enables bigger jumps and faster convergence, see Meng and van Dyk (1999) and van Dyk and Meng (20011 
for more details. 


3.2 Errors in Algorithms and the Corrections 


We refer to Algorithms 1 and 2 of IvD as Algorithms o and |2.1[ This allows us to clearly number 
the corrected versions of these algorithms. Similarly, we refer to the algorithm of BN as Algorithm |3.1| 
Algorithm |1.11 is displayed here and Algorithms 2.1 and 3.1 in Appendices [A| and [B| 

To obtain posterior samples under the MNP model. Algorithm |1.1| proceeds by sampling iteratively 
from p(a^, W\Y, /?, S), p{a‘^, /d\Y, W, S) and p{a^, S|y, W — aXjd, fd). The first of these draws is obtained 
via a sequence of conditional draws, see Step 1(b) of Algorithm |1.1[ Note that this algorithm marginalizes 

proceeds by sampling iteratively from p{a‘^,W\Y, /3,Y), p{a‘^,Tj\Y, 


a out in each step. Algorithm 


2.1 


W—aX 13, (3) and p(/3|T, W, S), again using a sequence of conditional draws for updating W. Algorithm 2.1 


6 
















Algorithm 1.1 

Step 0: Initialize parameters t = 0, and 

while t < T do 

Step 1: Update ,W*^ via p(a^ IU|y, by 

(a) sampling (a^)* from p(a^|S^*^): (a^)* ~ Ootrace /xlp'^ 

(b) sampling W* from p(IU*|y, (a^)*,/3^*^, E^*^): 

for i := 1,..., n do 
for fc := 1,... ,p do 

sampling Wt^ from p(mfe|y, Ty*.^,E^): ~ TNi^iik,rl), see Appendix [^for details; 

end for 

Set W* = a* Wt- 

end for 

Step 2: Update ^(a viap(a2,/J|y,iy*,EW) by 

(a) sampling {o?)* from p(a^|y, lU*, E^**): 

, , + +trace f^EW') 

(a ) ~ —2 , 

'^(n+h')p 


where/?= (eEi 'x, + A’l) ' (Er=i 'lU*); 

(b) sampling /3* from p{P\Y, W*, (a^)*, E*-*^): 




/3,(a")* ^X?’eW 'Xi + A- 


and setting = p* ja*. 

Step 3: Update via p(a^ E|y, TU*,/3(‘+^^) by 

(a) sampling E* from p(E|y, 


Inv-Wishart [n + u, ZiZ^ 


where Zi = W* - a*Xi/3(*+^); 

(b) setting = dh, = E7(^a(‘+^))^ and iy(*+i) = W^/a^-^+^K 

return and 

f + 1 ^ t 

end while 


does not marginalize a out when sampling /3. Algorithm |3.1| is an adaption of Algorithm o The only 
difference occurs in Step 3 when sampling (a^, S). In Algorithm 


1.1 


o? is set to the hrst element of S in 


Step 3(b), while it is set to trace(S)/p in Step 3(b) of Algorithm 


3.1 


Unfortunately, there are two errors in these algorithms, which may severely alter their stationary 
distributions, htted values, and convergence properties. In Algorithm [M] both errors are in Step 3. The 
hrst is rather simple. The transformation from (Z, to the original parameterization 

(lUh+i)^Q;h+i)j X;h+i)) should involve setting 


= {Z, + a 


for i = 1,..., n. 


(9) 


7 











Figure 1: The posterior samples of Wy, Ws, and W 34 obtained with Algorithms 1.2 and 1.3 appear in the first and 


second rows, respectively. The samples from Algorithm 1.2 not adhering to the constraint (101 are plotted in red. 


instead of = W *see Step 3(b). The correct inverse transformation is necessary to guarantee 

that the joint stationary distribution of is the target posterior distribution. 

The second problem is more subtle. When sampling S* while conditioning on T, Z, and 


Algorithm 1.1 uses Inv-Wishart(n + v, Step 3(a). This however ignores a constraint on 

S* imposed by Y and the current value of Z and j3. This constraint is on the first diagonal element of 
S*, i.e., In particular, if we set Zi (dJi) = Zi + for z = 1,... , n, the updated value 

of cr*;^ must satisfy 


max IZii (dll ), ■ ■ ■, Zip (dii)| < 0 
max |o, Zii (dll ), ■ ■ ■, Zip (dii)| = Zik (dii) if T* = A: 


if = 0 


, for i = 1,..., n. 


( 10 ) 


Thus, the conditional distribution of S* given Y, Z, and in Step 3(a) should be a constrained 

inverse-Wishart distribution. 

To illustrate the effect of the two corrections to Algorithm [T^ we compare it with two new algorithms: 

Algorithm 1.2: This is a partial correction to Algorithm o The only difference is that Algorithm 1.2 
transforms (Z,S*) to using ([^ in Step 3(b). 
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Draws from Alg. 1.3 
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Q-Q Plot of log(a^,) 



Draws from Alg. 3.2 


Draws from Alg. 3.2 


Figure 2: Quantile-quantile plots for comparing posterior draws from different algorithms in the simulation study. 
The columns correspond to five parameters, i.e., /3i, /32, log ^ log((Tii). The first row compares 

draws from Algorithms 1 1.1 1 and 1.3, the second row compares draws from Algorithms |2. 1 1 and 2.2, and the last row 
compares draws from Algorithms 


3.1 


and 3.2. 


Algorithm 1.3: This algorithm completely corrects Algorithm In particular, Steps 0, 1, and 2 of 
Algorithm 1.3 are the same as Algorithm 0 In Step 3(a), however, Algorithm 1.3 updates S* by 
sampling from a constrained inverse-Wishart distribution, that is, 


Inv-Wishart n 


j subject to the constraint in ([T^. 


2=1 


This is accomplished by simple rejection sampling; we iteratively sample from the unconstrained 


inverse-Wishart distribution until (10) is satisfied. Finally, in Step 3(b), Algorithm 1.3 sets = 




Algorithms 2.1 and 3.1 are adaptions of Algorithm Thus, both corrections affect these algorithms 
as well. The corrected versions of Algorithms |2 .1 1 and |3 .1 1 are called Algorithms 2.2 and 3.2 respectively. 
See Appendices |A] and [B] for details of these algorithms. 
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Algorithm 1.1 



t —.—I—I—I— ^ 

0 2000 6000 10000 
Iteration 





Figure 3: The sampling results of Algorithm 0 for the simulation study. The columns correspond to trace plots, 
autocorrelation plots, and histograms. The rows correspond to four parameters: /3i, /? 2 , log and log((T| 2 )- 

4 Simulation Study 


We use a simulation study to illustrate the differences in the convergence properties of Algorithms |1.H 1 . 2 , 


and 1.3, Algorithms 2.1 and 2.2, and Algorithms 3.1 and 3.2. We set n = 50, p = 2, q = 2, (3 = (—\/2,1), 
S = ^ For Xi = ^ x'^’l )) sample Xij^i (j = 1,2) from a uniform distribution on 

(—0.5, 0.5) for i = 1,..., 25, on (0.4,1.5) for i = 26,..., 50, and sample (j = 1, 2) from a uniform 
distribution on (—1,1) for z = 1,..., 25, on (0.8, 3) for i = 26,..., 50. We specify the prior distribution of 
S and as in Section]^ with u = p, = i', and S = Diag(l, 1), and for P, as f3 ^ Ng[0, Diag(100,100)]. 
For each algorithm, we run a chain of length 15,000 and discard the first 5,000 draws. 

Figure presents the posterior samples of VFy, VFg, and IF 34 obtained with Algorithms 1.2 and 1.3 


respectively. The draws obtained with Algorithm 1.2 that do not adhere to the constraint (10) are colored 


in red, which illustrates the second problem of Algorithm |1.1| described in Section 3.2 Such draws are 
rejected in Step 3(a) of Algorithm 1.3. 

Most importantly. Algorithms o (or 1.2), |2.1[ and |3.1| do not return draws from the target poste- 
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Algorithm 1.3 



—.—I—I—I—^ 

0 2000 6000 10000 
Iteration 
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Iteration 






o 

o 



in 


-25 


-15 


-5 0 




Figure 4: The sampling results of Algorithm 1.3 for the simulation study. The columns correspond to trace plots, 
autocorrelation plots, and histograms. The rows correspond to four parameters: /3i, /I 2 , log and log((T| 2 )- 


rior distribution. Figure shows the quantile-quantile plots of parameters to compare the stationary 
distributions of original and corrected algorithms. The hrst row of Figure compares Algorithms o 
and 1.3. The distributions of (5 differ slightly for the two algorithms, while the distributions of S dif¬ 
fer signihcantly, especially the correlation parameter, pi 2 = cr 12 /{<^ 11 ^ 22 )■ For Algorithms 1.2 and 1.3 
(not shown), the distributions of f3 are again similar, while the distributions of S again differ, but not 
as severely as Algorithms o and 1.3. The second row shows the quantile-quantile plots that compare 
Algorithms |2. 1| and 2.2. The distributions of both (5 and S are slightly different for the two algorithms. 
The last row of Figure [^compares Algorithms 3.1 and 3.2. The distributions of j3 are rather similar for 
the two algorithms, while the distributions of S are different, particularly pi 2 and 022 - 

Figures and 1^ show the sampling results of Algorithms o and 1.3 respectively. The columns in 
both hgures correspond to trace plots, autocorrelation plots, and histograms. The rows correspond to four 


parameters, namely, /3i, ^ 2 , log and log(iT 22 )- Algorithm 1.3 has better convergence properties 

than Algorithm 1 1.1 1 for all the four parameters in terms of mixing and autocorrelation. The convergence of 
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Alg. 1.1 

Alg. 1.2 

Alg. 1.3 

Alg. 2.1 

Alg. 2.2 

Alg. 3.1 

Alg. 3.2 

(3i 

0.693 

1.592 

1.610 

1.575 

2.111 

1.935 

2.573 

/32 

0.443 

1.584 

1.305 

1.384 

1.342 

2.492 

3.413 

log (frfi) 

- 

- 

- 

- 

- 

1.445 

1.351 

i“g(:T;) 

0.060 

0.625 

0.865 

0.782 

1.070 

0.768 

1.163 

log (crh) 

0.506 

1.334 

0.703 

1.474 

0.823 

1.261 

1.216 


Table 1: Effective sample size per second for each of five parameters, i.e., /3i, /32, log((jJ;^), log and 

log ((TI 2 ) in the simulation study, as obtained with Algorithms 1 1. l| -l .3, Algorithms |2 . l| -2 .2, and Algorithms |3.l| -3. 2 
respectively. 


Algorithm 1.2 is better than Algorithm|l.l[ but not as good as Algorithm 1.3. Moreover, Algorithms 2.2 


and 3.2 have slightly better convergence properties than Algorithms 2.1 and 3.1 respectively. We omit 


the corresponding plots for Algorithms 1.2, 2.1, 2.2, |3.H and 3.2 to save space. 

To further compare the convergence properties of these algorithms, we compute the effective sample 
size (ESS), dehned by 

ESS(0) = ^ (11) 

where T is the total posterior sample size and pt{0) is the lag-t autocorrelation of the parameter 9. ESS 
gives an estimate of the equivalent number of independent iterations that a Markov chain represents, 


and it indicates how well the chain mixes, see Kass et al. (19981 and Liu (20011. We use the function 
“effectiveSize” in the R package coda to calculate ESS. To account for the required CPU time, we compare 
ESS per second of these algorithms. The larger the value, the more efficient is the algorithm. The 
hrst three columns in Table present the ESS per second for each parameter for Algorithms [13^1.3, 
respectively. The fourth and hfth columns correspond to Algorithms |2.1| and 2.2, and the last two 
columns correspond to Algorithms |3.1| and 3.2. We hnd that in terms of ESS per second, Algorithm 1.3 is 
more efficient than Algorithm ] 1.1 [ even though it is more computationally demanding per iteration, and it 
performs similarly to Algorithm 1.2. Algorithm 2.2 performs similarly to Algorithm|2.1[ and Algorithm 3.2 


performs slightly better than Algorithm 3.1 
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Figure 5: Quantile-quantile plots for comparing Algorithms 1.1 and 1.3 in the margarine-purchase data analysis. 


5 Data Analysis 


For a further comparison of the algorithms, we consider a data set describing margarine purchases which is 
available in the bayesm package of R. Following BN, we limit analysis to purchases of six brands: “Parkay 
stick”, “Blue Bonnet stick”, “Fleischmanns stick”, “House brand stick”, “Generic stick”, and “Shedd Spread 
tub”, and only consider the first purchase of one of these brands for each household. This results in a 
dataset consisting of n = 507 observations. 

We set “Parkay stick” as the base category, and p = 5. Again following BN, we set up a model that 
only includes intercept terms for the other five categories and a coefficient for log prices. Thus q = 6, and 
Xi = [/p,( 7 j], where Ip is the identity matrix and Qi is the p-vector of differences in log prices between 
each category and the base. We again specify the prior distribution for S and as in Section with 
v = p, Uq = u, and S = Diag(l,..., 1), and for P, as /3 ^ Nq[0, Diag(100,..., 100)]. When implementing 
Algorithms 1.3, |2.1| and 2.2, we set the variance corresponding to “Blue Bonnet stick” as one. For 
each algorithm, we run a chain of length 300,000, discard the first 100,000 draws, and thin the rest draws 
by 10. In this way we obtain 20,000 draws from each algorithm. 


13 
















Q-Q Plot of Ps 


Q-Q Plot of P3 


Q-Q Plot of p4 


Q-Q Plot of Pe 




Draws from Alg. 2.2 

Q-Q Plot of log(a^,) 


Q-Q Plotof log(( 1 +P23)/(1-P23)) Q 




Draws from Alg. 2.2 

Q-Q Plot of logCojj) 



Draws from Alg. 2.2 
-Q Plotof log ((1 +P 24)/(1 


P24)) 




Draws from Alg. 2.2 

Q-Q Plot of logCojj) 


Draws from Alg. 2.2 
-Q Plot of log(( 1 +pi4)/(1 


P14)) 




Draws from Alg. 2.2 
Q-Q Plot of log((1+p34)/(1 


Draws from Alg. 2.2 
Q-Q Plot of log ((1 +P 35)/(1 -P35)) 




Draws from Alg. 2.2 


Draws from Alg. 2.2 


Draws from Alg. 2.2 


Draws from Alg. 2.2 


Figure 6: Quantile-quantile plots for comparing Algorithms 2.1 and 2.2 in the margarine-purchase data analysis. 


Figures and [^present the quantile-quantile plots of selected parameters correspondingly sampled 
with Algorithms |1.1| and 1.3, Algorithms |2.1| and 2.2, and Algorithms |3.1| and 3.2 respectively. The pa¬ 
rameters we consider are /32, ^ 3 , Pa, Pe, log(o-fi), log{al 2 ), logl^lg), log , log (F^) , log (F^) > 

log ^ Fp 34 ) ’ ^Fpft) ■ They are selected because their stationary distributions show relatively ob¬ 

vious difference for all three pairs of original and corrected algorithms. We hnd that Algorithms 0113 
and |3.1| all fail to deliver draws from the target posterior distribution. The situation is most substantial for 
Algorithm o Moreover, in terms of autocorrelation. Algorithm 1.3 performs substantially better than 
Algorithm o while Algorithms 2.2 and 3.2 perform similarly as Algorithms |2.1| and |3.1| respectively. In 


addition. Algorithms 1.3, 2 . 2 , and 3.2 take around 15% more computational time than Algorithms 1.1 2.1 


and 3.1 respectively. 


6 Conclusion 


The algorithms of IvD and BN are implemented in the popular R package MNP and are widely used 
for htting MNP models. We point out errors in these algorithms and propose corrections. Using both a 
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Figure 7: Quantile-quantile plots for comparing Algorithms 3.1 and 3.2 in the margarine-purchase data analysis. 


simulation study and a real-data analysis, we illustrate the difference between the original and corrected 
algorithms. From these analyses, we find that the errors can significantly affect the final results, especially 
in that they alter the stationary distribution and hence the fitted parameters. Considering the popularity 
of these algorithms, it is important that they are corrected. We have done so here and will do it soon 
in the MNP package. The corrected algorithms require some what more computational time due to the 
additional rejection sampling steps, however, the extra computational time is small and at least in some 
cases it is made up by the improved autocorrelation of the corrected algorithms. 
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Algorithm 2.1 

Step 0: Initialize parameters t = 0, and 

while t < T do 

Step 1: Update ((q^)*,Z) from p(a^, Z| Y,by 

(a) sampling (a^)* from p(a^|E*'*^): (a^)* ~ Ootrace ^ /Xvp', 

(b) sampling Z from p(Z| Y, (a^)*,E*^‘^): 
for i ~ 1,... ,n do 

for fc := 1,... ,p do 

sampling IY*fc via p(IYife|Y, W*.*., E^‘^): W**. ~ TN(pifc,ril), see Appendix for details; 

end for 

Set Zi = a*(IY* -Xi/3W). 

end for 

Step 2: Update via p(a^ E|Y, Z, by 

(a) sampling E* from p(E| Y, Z, 


' Inv-Wishart 


n + /z, ^ ^ ZiZi 


(b) setting = dh, = EV(^a(‘+^^)^ and = (Zi + 

Step 3: Update via p(^| Y, 




N, 




where/3 = + A"') ' (EHi 

return and 

t + 1 ^ t 

end while 


APPENDIX: Details of Algorithms |2.1f -3.2 


A Algorithms 2.1 and 2.2 


Algorithm 2 of IvD does not marginalize a out when updating /3. We call this algorithm Algorithm 2.1 


in this paper. Algorithm 2.1 can be used when the prior mean of /3, /3o, is not equal to zero, while 
Algorithm o can not. 

The error arises in Step 2(a), which is the same as the error in Step 3(a) of Algorithm o Thus the 
correction to Algorithm 2.1 is 


Algorithm 2.2: Steps 0, 1, and 3 of Algorithm 2.2 are the same as Algorithm |2.1[ In Step 2(a), however. 
Algorithm 2.2 updates S* by sampling from a constrained inverse-Wishart distribution, that is. 


S* ~ Inv-Wishart j n -|- zz, ZiZj j subject to the constraint in (10) 


2=1 
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Algorithm 3.1 

Step 0: Initialize parameters t = 0, and 

while t < T do 

Step 1: Update ,W*^ via p(a^, IU|y,by 

(a) sampling (a^)* from p(a^|S^*^): (a^)* ~ Ootrace /xlp'^ 

(b) sampling W* from p(IU*|y, (a^)*,/3^*^, E^*^): 

for i := 1,..., n do 
for fc := 1,... ,p do 

sampling Wt^ from p(mfe|y, Ty*.^,E^): lU^ ~ TNi^iik,rl), see Appendix [^for details; 
end for 

Set W* = a*W*. 

end for 

Step 2: Update ^(a viap(a2,/J|y,iy*,EW) by 

(a) sampling {o?)* from p(a^|y, lU*, E^**): 


{a^ 


Er=i \iy/-Xi/3) + /3TA-i/3 +trace (^eW 

^(n+h')p 


where/?= (eEi 'x, + A’l) ' (Er=i 'lU*); 

(b) sampling /3* from p{P\Y, W*, (a^)*, E*-*^): 




/3,(a")* ^X?’eW 'Xi + A- 


and setting = p* ja*. 

Step 3: Update via p(a^ E|y, TU*,/3(‘+^^) by 

(a) sampling E* from p(E|y, 


Inv-Wishart [n + u, ZiZ^ 


where Zi = W* - Q*Xi/ 3t*+^); 

(b) setting = y^trace(E*/p), = E*/(^a(*+l))^ = P*/a^^+^'>, and iy(*+i) = iy*/a(*+i). 

return E(*+i), and iy(*+i) 

i -|- 1 — t 

end while 


Note that in Zj of the constraint (10) should be replaced by in Algorithm 2.2. 


B Algorithms 3.1 and 3.2 


We call the algorithm of BN Algorithm 3.1 in this paper. Algorithm 3.1 is almost the same as Algo¬ 


rithm 0 The only difference is Step 3(b). Specihcally, hrst, in Algorithm |3.l[ in this step is set to 
trace(I])/p, while in Algorithm |l.l[ o? is set to the first element of S; second, Algorithm 
in Step 3(b), while Algorithm o not. 


3.1 


sets (3 = P/a 


Besides applying the two corrections to Step 3 of Algorithm 3.1 we further remove “/jh+i) = q-(*+i) 
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in Step 3(b) of Algorithm 3.1 because we update S* conditioning on (y, Z,not on {Y,W*, P*). 
Thus, we get 


Algorithm 3.2: This algorithm completely corrects Algorithm 3.1 In particular, Steps 0, 1, and 2 of 


Algorithm 3.2 are the same as Algorithm 3.1 In Step 3(a), however. Algorithm 3.2 updates S* by 
sampling from a constrained inverse-Wishart distribution, that is. 


S* ~ Inv-Wishart j n + ZiZj' j subject to the constraint (10*). 


i=l 


Constraint (10*) is an adaption of (10) by replacing with r = 'y/trace(S*/p). Specifically, 


Zi{r) = Zi + for z = 1,... , n. The updated value of r must satisfy 


max < Z; 


i(r),... ,Zip(r)| < 0 


if Ti = 0 


max 0, Zii{r), ..., Zip(r) ^ = Zik{r) if Yi = k 


, for i = 1,..., n. 


(10*) 


Finally, in Step 3(b), Algorithm 3.2 sets aO+i) = ^ trace(S*/p), = S*/(a(*+^))^, and 

^(t+i) ^ (^. + a(i+i)Xi/30+i))/a(*+i). 

C Details of Sampling W in Step 1(b) of Algorithms |l.l| -3.2 

Updating W in Step 1(b) of Algorithms 1 1. l| -3 .2 consists of sampling from a series of univariate truncated 
normal distributions, that is, for z = 1,..., n and k = 1,... ,p, 


iy*,~TN(^,fc,4), 


where = X,,/30)+S« fcS«;L',(IT,*_,-X,,_,/3W) with IT*., = (w,l,..., ■ • •, 


and = (zxS) - ■ 


ik — \^kk I ~ ^k-k^-k-k^-k,k^ is the kih. row of Xi, and A* _fc is the sub-matrix of Xi with 
Xik removed. The constraint on W*f, is, > max{0, if Yi = k] < 0, if T) = 0; and 

IT* < max{0, IT*.}, if T* = j / k. 

If the constraint on W*j^ has the form, > w, and w < 0, we update with simple rejection 

sampling: we iteratively sample from the unconstrained normal distribution until W*j^ > w is satisfied. If 


bUjfc > w, but zn > 0, we update with the exponential rejection sampling proposed by Robert (1995). 
If the constraint on W*j^ has the form < w, we can apply the above sampling scheme with slight 
adaption, since —> —w. 
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